Code
stopifnot(
require(tidyverse),
require(patchwork),
require(httr),
require(glue),
require(broom)
)
old_theme <- theme_set(theme_minimal())stopifnot(
require(tidyverse),
require(patchwork),
require(httr),
require(glue),
require(broom)
)
old_theme <- theme_set(theme_minimal())lm_1lm_1Environment panel (Rstudio), to explore the structure of lm_1. Try to understand the signification of each element.datapath <- '../DATA'
fname <- 'Banque.csv'
fpath <- paste(datapath, fname, sep="/")if (!file.exists(fpath)) {
baseurl <- 'https://stephane-v-boucheron.fr/data'
download.file(url=paste(baseurl, fname, sep="/"),
destfile=fpath)
print(glue::glue('File {fname} downloaded at {fpath}!'))
} else {
print(glue::glue('File {fname} already exists at {fpath}!'))
}Warning in download.file(url = paste(baseurl, fname, sep = "/"), destfile =
fpath): URL https://stephane-v-boucheron.fr/data/Banque.csv: cannot open
destfile './DATA/Banque.csv', reason 'No such file or directory'
Warning in download.file(url = paste(baseurl, fname, sep = "/"), destfile =
fpath): download had nonzero exit status
File Banque.csv downloaded at ./DATA/Banque.csv!
bank <- readr::read_table(fpath,
col_types = cols(
SEXE = col_factor(levels = c("0", "1")),
CATEGORIE = col_integer(),
NB_ETUDES = col_integer(),
SATIS_EMPLOI = col_factor(levels = c("non", "oui")),
SATIS_CHEF = col_factor(levels = c("non", "oui")),
SATIS_SALAIRE = col_factor(levels = c("non", "oui")),
SATIS_COLLEGUES = col_factor(levels = c("non", "oui")),
SATIS_CE = col_factor(levels = c("non", "oui"))
)
)broom::tidy()broom::glance()broom::augment()The output of augment may be described as adding 6 columns to dataframe bank. The six columns are built using items from lm_1. Can you explain their meaning and why they are relevant to diagnosing?
Let base R produce diagnostic plots
plot(lm_1, which = 1:6)We will reproduce (and discuss) four of the six diagnostic plots provided by the plot method from base R (1,2,3,5).
ggplot using the aumented version of lm_1 (augment(lm_1)).Comment Diagnostic Plot 1.
Compute the correlation coefficient between residuals and fitted values.
Make your graphic pipeline a reusable function.
ggplot.Make your graphic pipeline a reusable function.
ggplot.In the second diagnostic plot (the residuals qqplot), we build a quantile-quantile plot by plotting function \(F_n^{\leftarrow} \circ \Phi\) where \(\Phi\) is the ECDF of the standard Gaussian distribution while \(F^\leftarrow_n\).
ggplotpatchwork::... to collect your four diagnostic plotsSAL_ACTUELAGE and SAL_ACTUELRedo the above described steps and call the model lm_2.
ggplot programming : write a function with arguments df, varx and vary where varx and vary are two strings denoting numerical columns in df, that outputs a ggplot object made of a scatterplot of columns vary and vary, a linear regression of vary against varx. The ggplot plot object should be annotated with the linear correlation coefficient of vary and varx and equipped with a title.bank %>%
ggplot() +
aes(x=AGE, y=SAL_ACTUEL) +
geom_point(alpha=.5, size=.5, ) +
geom_smooth(method="lm", formula= y ~ x, se=F) +
annotate(geom="text", x=60, y=60000,
label=str_c("rho = ",
round(cor(bank$SAL_ACTUEL, bank$AGE), 2))) +
ggtitle("Bank dataset")ggplot_lin_reg <- function(df, varx, vary){
rho <- round(cor(df[[varx]], df[[vary]]), 2)
posx <- sum(range(df[[varx]])*c(.25 , .75))
posy <- sum(range(df[[vary]])*c(.5 , .5))
df %>%
ggplot() +
aes(x=.data[[varx]], y=.data[[vary]]) +
geom_point(alpha=.5, size=.5, ) +
geom_smooth(method="lm", formula= y ~ x, se=F) +
annotate(geom="text", x=posx, y=posy,
label=glue("Linear Cor. Coeff.: {rho}")) +
ggtitle(glue("Linear regression: {vary} ~ {varx}"))
}
ggplot_lin_reg(bank, "AGE", "SAL_ACTUEL")Inspect rows with high Cook’s distance
SAL_ACTUEL and AGESAL_ACTUEL as a function of age AGETo perform linear fitting, we choose \(450\) points amongst the \(474\) sample points: the \(24\) remaining points are used to assess the merits of the linear fit.
banque dataframe.Function sample from base R is convenient. You may also enjoy slice_sample() from dplyr. Denote by trainset the vector of of selected indices. Bind the vector of left behind indices to variable testset. Functions match, setdiff or operator %in% may be useful.
Inspecting points with high Cook’s distance
\[ \begin{pmatrix} Y \end{pmatrix} = \begin{pmatrix} \mathbb{Z} \end{pmatrix} \times \beta + \sigma \begin{pmatrix} \epsilon \end{pmatrix} \]